[SPARK-39741][SQL] Support url encode/decode as built-in function and tidy up url-related functions#37113
[SPARK-39741][SQL] Support url encode/decode as built-in function and tidy up url-related functions#37113yikf wants to merge 2 commits intoapache:masterfrom yikf:url
Conversation
|
cc: @cloud-fan , I'm not sure why we don't support url encode\decode function as built-in function, but if we intentionally don't, i will close this title. |
There was a problem hiding this comment.
I think the file name should be urlExpressions
|
@yikf Please run |
thanks~ i will address those comment if this pr is necessary |
|
do other databases provide these functions? |
Both Trino & Presto have URL-related URL functions, including url_decode/url_encode, reference doc: https://trino.io/docs/current/functions/url.html, However, I can't find a function like url_decode in Hive or mysql, etc. maybe we have to use reflect function instead in hive, It's a bit of a hassle, And often these functions are useful. |
There was a problem hiding this comment.
We can make them extend RuntimeReplaceable, and use StaticInvoke as the real expression.
There was a problem hiding this comment.
okay, please take a look when you have a time, thanks
|
Can one of the admins verify this patch? |
|
@yikf Could you update the pr description because I found this pr also includes some refactoring work |
done |
| // scalastyle:off line.size.limit | ||
| @ExpressionDescription( | ||
| usage = """ | ||
| _FUNC_(str) - Translates a string into {@code application/x-www-form-urlencoded} format using a specific encoding scheme. |
There was a problem hiding this comment.
Does {@code application/x-www-form-urlencoded} really work? Can you run DESC FUNCTION url_encode and give a screenshot?
There was a problem hiding this comment.
This means it does not work... The SQL doc can't interpret {@code ...}. Can we just use plain string?
.... into 'application/x-www-form-urlencoded' format ...
There was a problem hiding this comment.
Sorry i misread this meaning, updated
cloud-fan
left a comment
There was a problem hiding this comment.
LGTM except for a minor comment
|
|
||
| object UrlCodec { | ||
| def encode(src: UTF8String, enc: UTF8String): UTF8String = { | ||
| UTF8String.fromString(URLEncoder.encode(src.toString, enc.toString)) |
There was a problem hiding this comment.
We use java.net.URLEncoder now.
I wonder if org.apache.commons.codec.net.URLCodec and java.net.URLEncoder have the same semantics?
Will org.apache.commons.codec.net.URLCodec have better performance?
There was a problem hiding this comment.
I write a micro bench as follows:
def testEncode(url: String, valuesPerIteration: Int): Unit = {
import org.apache.commons.codec.net.URLCodec
val benchmark = new Benchmark("Test encode", valuesPerIteration, output = output)
benchmark.addCase("Use java.net.URLEncoder") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
java.net.URLEncoder.encode(url, "UTF-8")
}
}
val codec = new URLCodec()
benchmark.addCase("Use org.apache.commons.codec.net.URLCodec") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
codec.encode(url)
}
}
benchmark.run()
}
def testDecode(value: String, valuesPerIteration: Int): Unit = {
import org.apache.commons.codec.net.URLCodec
val benchmark = new Benchmark("Test decode", valuesPerIteration, output = output)
benchmark.addCase("Use java.net.URLEncoder") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
java.net.URLDecoder.decode(value, "UTF-8")
}
}
val codec = new URLCodec()
benchmark.addCase("Use org.apache.commons.codec.net.URLCodec") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
codec.decode(value)
}
}
benchmark.run()
}
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val valuesPerIteration = 100000
// Test Contains
testEncode("https://spark.apache.org", valuesPerIteration)
testDecode("https%3A%2F%2Fspark.apache.org", valuesPerIteration)
}and run it use GA, the result as follows:
Java 8
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 32 33 1 3.1 324.3 1.0X
Use org.apache.commons.codec.net.URLCodec 29 29 0 3.4 291.4 1.1X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 53 54 1 1.9 530.0 1.0X
Use org.apache.commons.codec.net.URLCodec 37 37 0 2.7 365.5 1.5X
Java 11
OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 25 28 2 4.0 251.6 1.0X
Use org.apache.commons.codec.net.URLCodec 22 25 1 4.6 219.5 1.1X
OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 31 31 1 3.3 306.2 1.0X
Use org.apache.commons.codec.net.URLCodec 26 29 2 3.8 260.4 1.2X
Java 17
OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 31 32 1 3.2 308.3 1.0X
Use org.apache.commons.codec.net.URLCodec 26 29 2 3.8 262.6 1.2X
OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 40 44 3 2.5 402.0 1.0X
Use org.apache.commons.codec.net.URLCodec 31 33 1 3.2 311.0 1.3X
``` ~~
`org.apache.commons.codec.net.URLCodec` seems a little quick than `java.net.URLEncoder` from benchmark result, so should we change to use `org.apache.commons.codec.net.URLCodec` in this pr? @cloud-fan @Yikf
There was a problem hiding this comment.
Sorry, the above bench test encode twice, need re-run, please ignore it
There was a problem hiding this comment.
Update result:
Java 8
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 40 41 1 2.5 404.9 1.0X
Use org.apache.commons.codec.net.URLCodec 36 38 1 2.8 358.5 1.1X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test decode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 38 39 1 2.6 379.0 1.0X
Use org.apache.commons.codec.net.URLCodec 31 32 2 3.2 312.8 1.2X
Java 11
OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 26 26 0 3.9 256.5 1.0X
Use org.apache.commons.codec.net.URLCodec 24 24 1 4.2 240.5 1.1X
OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test decode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 22 22 0 4.6 217.5 1.0X
Use org.apache.commons.codec.net.URLCodec 18 18 0 5.6 179.0 1.2X
Java 17
OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 29 33 2 3.4 290.1 1.0X
Use org.apache.commons.codec.net.URLCodec 31 35 2 3.3 305.3 1.0X
OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Test decode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 22 26 2 4.5 223.1 1.0X
Use org.apache.commons.codec.net.URLCodec 25 29 2 4.0 252.3 0.9X
There was a problem hiding this comment.
Seems like java.net.URLEncoder is the future (Java 17)? Assuming that Spark will get to Java 17 eventually, I'd vote for java.net.URLEncoder to avoid any breaking change when we want to switch to java.net.URLEncoder in the future.
There was a problem hiding this comment.
Thanks @LuciferYang for benchmark test result, Seems like the benchmark did not diff much and for the reason that Spark will bump JDK and avoid any breaking change, java.net.URLEncoder is a better fit, what do you think @LuciferYang , @cloud-fan
|
@LuciferYang @cloud-fan Any other comments please? |
|
friendly ping @cloud-fan |
|
thanks, merging to master! |
|
thank you all |

What changes were proposed in this pull request?
Currently, Spark don't support url encode/decode as built-in functions, the user might use reflect instead, It's a bit of a hassle, And often these functions are useful.
This pr aims to two points as follow:
Why are the changes needed?
url encode/decode functions are useful
Does this PR introduce any user-facing change?
yes, add new function as built-in function
How was this patch tested?
add new tests